--- layout: post title: "Implementing Machine Learning Models for Heart Disease Prediction" subtitle: "Showcase of data extraction" date: 2022-08-19 05:27:00 -0400 tags: jupyter_notebook machine_learning database regression data_science background: '/img/posts/hpbg.jpg' ---
import pandas as pd
import pandas_profiling as pp #For exploratory data analysis
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import glob
read_csv function.
df = pd.read_csv('heart.csv')
df
| age | sex | cp | trestbps | chol | fbs | restecg | thalach | exang | oldpeak | slope | ca | thal | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | 1 | 0 | 125 | 212 | 0 | 1 | 168 | 0 | 1.0 | 2 | 2 | 3 | 0 |
| 1 | 53 | 1 | 0 | 140 | 203 | 1 | 0 | 155 | 1 | 3.1 | 0 | 0 | 3 | 0 |
| 2 | 70 | 1 | 0 | 145 | 174 | 0 | 1 | 125 | 1 | 2.6 | 0 | 0 | 3 | 0 |
| 3 | 61 | 1 | 0 | 148 | 203 | 0 | 1 | 161 | 0 | 0.0 | 2 | 1 | 3 | 0 |
| 4 | 62 | 0 | 0 | 138 | 294 | 1 | 1 | 106 | 0 | 1.9 | 1 | 3 | 2 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | 1 | 1 | 140 | 221 | 0 | 1 | 164 | 1 | 0.0 | 2 | 0 | 2 | 1 |
| 1021 | 60 | 1 | 0 | 125 | 258 | 0 | 0 | 141 | 1 | 2.8 | 1 | 1 | 3 | 0 |
| 1022 | 47 | 1 | 0 | 110 | 275 | 0 | 0 | 118 | 1 | 1.0 | 1 | 1 | 2 | 0 |
| 1023 | 50 | 0 | 0 | 110 | 254 | 0 | 0 | 159 | 0 | 0.0 | 2 | 0 | 2 | 1 |
| 1024 | 54 | 1 | 0 | 120 | 188 | 0 | 1 | 113 | 0 | 1.4 | 1 | 1 | 3 | 0 |
1025 rows × 14 columns
df.dtypes
age int64 sex int64 cp int64 trestbps int64 chol int64 fbs int64 restecg int64 thalach int64 exang int64 oldpeak float64 slope int64 ca int64 thal int64 target int64 dtype: object
The data has 14 attributes and 1025 observations. The attributes are:
The thal attribute corresponds to an assesment of thalassemia. The labels provided initially don't correspond to what is on the table. After a bit of research though, I found that the labels above are mapped to the following numbers on the table:
pandas_profiling extends the capabilities of the basic .describe function which allow a more extensive and quick exploratory data analysis to be carried out. This package provides the following information for each column:
pp.ProfileReport(df)
Summarize dataset: 0%| | 0/5 [00:00<?, ?it/s]
Generate report structure: 0%| | 0/1 [00:00<?, ?it/s]
Render HTML: 0%| | 0/1 [00:00<?, ?it/s]
Here is a summary of the most important takeaways from the generated report that I could see:
thalach and ca attributesthal attribute.
3.The most frequent chest pain type is typical angina (48.5%) followed by non-anginal pain (27.7%). It is highly correlated to the exang and target attributes.exang, target and age attributes. Next, we can look at the interaction plots for the interval variables (i.e., the bivariate distributions). I'll describe what some of the plots are showing:
Age and oldpeak have strong interactions in the 40-60 year old range when old peak is 0.
Age and resting blood pressure have strong interactions in the 50-60 year old range when the blood pressure is approximately 130 mmHg.
Age and cholesterol have strong interactions in the 50-60 year old range when the cholesterol levels are between 200-250 mg/dL
Age and maximum heart rate achieved have strong interactions for 60 year olds when the maximum heart rate is between 140 and 160 BPM.
Resting blood pressure and oldpeak have strong interaction at 130mmHg and oldpeak of 0
Resting blood pressure and cholesterol have strong interactions between 120-150mmHg and 180-300mg/dL cholester levels
Resting blood pressure and maximum heart rate achieved have strong interactions between 110-150mmHg and 140-180BPM
Cholesterol and resting blood pressure have strong interactions in the 180-200mg/dL range and 120-150mmHg
Cholesterol and maximum heart rate achieved have strong interactions between 200-300mg/dL and 140-180BPM
I'm going to quickly rename the column labels for the dataframe because to make interpretation of results easier moving forward.
df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol',
'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved','exercise_induced_angina',
'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'diagnosis']
df
| age | sex | chest_pain_type | resting_blood_pressure | cholesterol | fasting_blood_sugar | rest_ecg | max_heart_rate_achieved | exercise_induced_angina | st_depression | st_slope | num_major_vessels | thalassemia | diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | 1 | 0 | 125 | 212 | 0 | 1 | 168 | 0 | 1.0 | 2 | 2 | 3 | 0 |
| 1 | 53 | 1 | 0 | 140 | 203 | 1 | 0 | 155 | 1 | 3.1 | 0 | 0 | 3 | 0 |
| 2 | 70 | 1 | 0 | 145 | 174 | 0 | 1 | 125 | 1 | 2.6 | 0 | 0 | 3 | 0 |
| 3 | 61 | 1 | 0 | 148 | 203 | 0 | 1 | 161 | 0 | 0.0 | 2 | 1 | 3 | 0 |
| 4 | 62 | 0 | 0 | 138 | 294 | 1 | 1 | 106 | 0 | 1.9 | 1 | 3 | 2 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | 1 | 1 | 140 | 221 | 0 | 1 | 164 | 1 | 0.0 | 2 | 0 | 2 | 1 |
| 1021 | 60 | 1 | 0 | 125 | 258 | 0 | 0 | 141 | 1 | 2.8 | 1 | 1 | 3 | 0 |
| 1022 | 47 | 1 | 0 | 110 | 275 | 0 | 0 | 118 | 1 | 1.0 | 1 | 1 | 2 | 0 |
| 1023 | 50 | 0 | 0 | 110 | 254 | 0 | 0 | 159 | 0 | 0.0 | 2 | 0 | 2 | 1 |
| 1024 | 54 | 1 | 0 | 120 | 188 | 0 | 1 | 113 | 0 | 1.4 | 1 | 1 | 3 | 0 |
1025 rows × 14 columns
I'm also going to change the values of the categorical variables to things that are more readily interpretable to me since it was a bit annoying to keep reminding myself of the definitions of these different attributes.
df.loc[df['sex'] == 0, 'sex'] = 'female'
df.loc[df['sex'] == 1, 'sex'] = 'male'
df.loc[df['chest_pain_type'] == 0, 'chest_pain_type'] = 'typical angina'
df.loc[df['chest_pain_type'] == 1, 'chest_pain_type'] = 'atypical angina'
df.loc[df['chest_pain_type'] == 2, 'chest_pain_type'] = 'non-anginal pain'
df.loc[df['chest_pain_type'] == 3, 'chest_pain_type'] = 'asymptomatic'
df.loc[df['fasting_blood_sugar'] == 0, 'fasting_blood_sugar'] = 'lower than 120mg/ml'
df.loc[df['fasting_blood_sugar'] == 1, 'fasting_blood_sugar'] = 'greater than 120mg/ml'
df.loc[df['rest_ecg'] == 0, 'rest_ecg'] = 'normal'
df.loc[df['rest_ecg'] == 1, 'rest_ecg'] = 'ST-T wave abnormality'
df.loc[df['rest_ecg'] == 2, 'rest_ecg'] = 'left ventricular hypertrophy'
df.loc[df['exercise_induced_angina'] == 0, 'exercise_induced_angina'] = 'no'
df.loc[df['exercise_induced_angina'] == 1, 'exercise_induced_angina'] = 'yes'
df.loc[df['st_slope'] == 0, 'st_slope'] = 'upslope'
df.loc[df['st_slope'] == 1, 'st_slope'] = 'flat'
df.loc[df['st_slope'] == 2, 'st_slope'] = 'downslope'
df.loc[df['thalassemia'] == 1, 'thalassemia'] = 'fixed defect'
df.loc[df['thalassemia'] == 2, 'thalassemia'] = 'normal'
df.loc[df['thalassemia'] == 3, 'thalassemia'] = 'reversable defect'
df
| age | sex | chest_pain_type | resting_blood_pressure | cholesterol | fasting_blood_sugar | rest_ecg | max_heart_rate_achieved | exercise_induced_angina | st_depression | st_slope | num_major_vessels | thalassemia | diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | male | typical angina | 125 | 212 | lower than 120mg/ml | ST-T wave abnormality | 168 | no | 1.0 | downslope | 2 | reversable defect | 0 |
| 1 | 53 | male | typical angina | 140 | 203 | greater than 120mg/ml | normal | 155 | yes | 3.1 | upslope | 0 | reversable defect | 0 |
| 2 | 70 | male | typical angina | 145 | 174 | lower than 120mg/ml | ST-T wave abnormality | 125 | yes | 2.6 | upslope | 0 | reversable defect | 0 |
| 3 | 61 | male | typical angina | 148 | 203 | lower than 120mg/ml | ST-T wave abnormality | 161 | no | 0.0 | downslope | 1 | reversable defect | 0 |
| 4 | 62 | female | typical angina | 138 | 294 | greater than 120mg/ml | ST-T wave abnormality | 106 | no | 1.9 | flat | 3 | normal | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | male | atypical angina | 140 | 221 | lower than 120mg/ml | ST-T wave abnormality | 164 | yes | 0.0 | downslope | 0 | normal | 1 |
| 1021 | 60 | male | typical angina | 125 | 258 | lower than 120mg/ml | normal | 141 | yes | 2.8 | flat | 1 | reversable defect | 0 |
| 1022 | 47 | male | typical angina | 110 | 275 | lower than 120mg/ml | normal | 118 | yes | 1.0 | flat | 1 | normal | 0 |
| 1023 | 50 | female | typical angina | 110 | 254 | lower than 120mg/ml | normal | 159 | no | 0.0 | downslope | 0 | normal | 1 |
| 1024 | 54 | male | typical angina | 120 | 188 | lower than 120mg/ml | ST-T wave abnormality | 113 | no | 1.4 | flat | 1 | reversable defect | 0 |
1025 rows × 14 columns
Much better! Let me make sure that the they remain as categorical variables.
df.dtypes
age int64 sex object chest_pain_type object resting_blood_pressure int64 cholesterol int64 fasting_blood_sugar object rest_ecg object max_heart_rate_achieved int64 exercise_induced_angina object st_depression float64 st_slope object num_major_vessels int64 thalassemia object diagnosis int64 dtype: object
Finally, I'll generate dummy variables for the categorical variables in order to be able to use them in the feature selection process.
df = pd.get_dummies(df, drop_first=True)
df
| age | resting_blood_pressure | cholesterol | max_heart_rate_achieved | st_depression | num_major_vessels | diagnosis | sex_male | chest_pain_type_atypical angina | chest_pain_type_non-anginal pain | chest_pain_type_typical angina | fasting_blood_sugar_lower than 120mg/ml | rest_ecg_left ventricular hypertrophy | rest_ecg_normal | exercise_induced_angina_yes | st_slope_flat | st_slope_upslope | thalassemia_fixed defect | thalassemia_normal | thalassemia_reversable defect | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | 125 | 212 | 168 | 1.0 | 2 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1 | 53 | 140 | 203 | 155 | 3.1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 |
| 2 | 70 | 145 | 174 | 125 | 2.6 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 3 | 61 | 148 | 203 | 161 | 0.0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 4 | 62 | 138 | 294 | 106 | 1.9 | 3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | 140 | 221 | 164 | 0.0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
| 1021 | 60 | 125 | 258 | 141 | 2.8 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 1022 | 47 | 110 | 275 | 118 | 1.0 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 |
| 1023 | 50 | 110 | 254 | 159 | 0.0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1024 | 54 | 120 | 188 | 113 | 1.4 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1025 rows × 20 columns
Finally, I'll move the diagnosis column to the end in preparation for the next stage of the process.
new_cols = [col for col in df.columns if col != 'diagnosis'] + ['diagnosis']
df = df[new_cols]
df
| age | resting_blood_pressure | cholesterol | max_heart_rate_achieved | st_depression | num_major_vessels | sex_male | chest_pain_type_atypical angina | chest_pain_type_non-anginal pain | chest_pain_type_typical angina | fasting_blood_sugar_lower than 120mg/ml | rest_ecg_left ventricular hypertrophy | rest_ecg_normal | exercise_induced_angina_yes | st_slope_flat | st_slope_upslope | thalassemia_fixed defect | thalassemia_normal | thalassemia_reversable defect | diagnosis | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 52 | 125 | 212 | 168 | 1.0 | 2 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 53 | 140 | 203 | 155 | 3.1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 2 | 70 | 145 | 174 | 125 | 2.6 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
| 3 | 61 | 148 | 203 | 161 | 0.0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 4 | 62 | 138 | 294 | 106 | 1.9 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1020 | 59 | 140 | 221 | 164 | 0.0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1021 | 60 | 125 | 258 | 141 | 2.8 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
| 1022 | 47 | 110 | 275 | 118 | 1.0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 |
| 1023 | 50 | 110 | 254 | 159 | 0.0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
| 1024 | 54 | 120 | 188 | 113 | 1.4 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1025 rows × 20 columns
sns.pairplot(df, hue='diagnosis')
<seaborn.axisgrid.PairGrid at 0x2adbc766608>
Perfect! Now we are ready to move forward.
We can also use feature_selection from sklearn to give us the chi squared statistic value. The null hypothesis for chi2 test is that two categorical variables are independent. Therefore, a higher value of the chi2 statistic means that two categorical variables are dependent and are thus more useful for classification. In other words, the highest chi squared values will represent the strongest relationships between our dependent variable (the diagnosis) and our independent variable.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2, r_regression
data = df.copy()
X = data.iloc[:,0:19] #independent variable columns
y = data.iloc[:,-1] #dependent variable column
#Apply SelectKBest to extract best features
bestfeatures = SelectKBest(score_func=chi2, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#Concatenate two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Attributes','chi squared score'] #naming the resulting dataframe columns
print(featureScores.nlargest(19,'chi squared score')) #print 10 best features
Attributes chi squared score 3 max_heart_rate_achieved 650.008493 4 st_depression 253.653461 5 num_major_vessels 210.625919 9 chest_pain_type_typical angina 142.563300 18 thalassemia_reversable defect 141.524151 13 exercise_induced_angina_yes 130.470927 17 thalassemia_normal 129.833983 2 cholesterol 110.723364 0 age 81.425368 8 chest_pain_type_non-anginal pain 75.643418 14 st_slope_flat 66.295938 7 chest_pain_type_atypical angina 55.917533 1 resting_blood_pressure 45.974069 6 sex_male 24.373650 12 rest_ecg_normal 13.568869 16 thalassemia_fixed defect 8.772019 11 rest_ecg_left ventricular hypertrophy 5.888640 15 st_slope_upslope 5.381752 10 fasting_blood_sugar_lower than 120mg/ml 0.259249
#Apply SelectKBest to extract best features
bestfeatures = SelectKBest(score_func=r_regression, k=10)
fit = bestfeatures.fit(X,y)
dfscores = pd.DataFrame(fit.scores_)
dfcolumns = pd.DataFrame(X.columns)
#Concatenate two dataframes for better visualization
featureScores = pd.concat([dfcolumns,dfscores],axis=1)
featureScores.columns = ['Attributes','r squared score'] #naming the resulting dataframe columns
print(featureScores.nlargest(19,'r squared score')) #print 10 best features
Attributes chi squared score 17 thalassemia_normal 0.519543 3 max_heart_rate_achieved 0.422895 8 chest_pain_type_non-anginal pain 0.319504 7 chest_pain_type_atypical angina 0.255288 10 fasting_blood_sugar_lower than 120mg/ml 0.041164 15 st_slope_upslope -0.075227 11 rest_ecg_left ventricular hypertrophy -0.076357 16 thalassemia_fixed defect -0.095541 2 cholesterol -0.099966 1 resting_blood_pressure -0.138772 12 rest_ecg_normal -0.160308 0 age -0.229324 6 sex_male -0.279501 14 st_slope_flat -0.349417 5 num_major_vessels -0.382085 13 exercise_induced_angina_yes -0.438029 4 st_depression -0.438441 18 thalassemia_reversable defect -0.479709 9 chest_pain_type_typical angina -0.519621
The results of our feature selection indicate that the max_heart_rate_achieved is the best feature to use. In general, the less features our model has, the easier the interpretation of the result. I'll choose the the top 3 features based on their chi squared score to use as my independent variables for now.
Let's start by defining our independent variables which we can do by keeping only the 3 features we'll be using as our independent variables based on the feature selection results we obtained above.
cols_to_keep = ['max_heart_rate_achieved', 'st_depression', 'num_major_vessels']
x = df[cols_to_drop]
x
| max_heart_rate_achieved | st_depression | num_major_vessels | |
|---|---|---|---|
| 0 | 168 | 1.0 | 2 |
| 1 | 155 | 3.1 | 0 |
| 2 | 125 | 2.6 | 0 |
| 3 | 161 | 0.0 | 1 |
| 4 | 106 | 1.9 | 3 |
| ... | ... | ... | ... |
| 1020 | 164 | 0.0 | 0 |
| 1021 | 141 | 2.8 | 1 |
| 1022 | 118 | 1.0 | 1 |
| 1023 | 159 | 0.0 | 0 |
| 1024 | 113 | 1.4 | 1 |
1025 rows × 3 columns
And now our dependent variable is selected as such
y = df['diagnosis']
y
0 0
1 0
2 0
3 0
4 0
..
1020 1
1021 0
1022 0
1023 1
1024 0
Name: diagnosis, Length: 1025, dtype: int64
Next, I'll start with an 80:20 split for our testing and training sets which I'll then normalize via min-max normalization.
from sklearn.model_selection import train_test_split #for data splitting
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)
#Normalize independent variable data for both sets
X_train = (X_train-np.min(X_train))/(np.max(X_train)-np.min(X_train)).values
X_test = (X_test-np.min(X_test))/(np.max(X_test)-np.min(X_test)).values
Let's set up a few classifiers
# Get ML regressors and classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
#Get model metrics
from sklearn.metrics import confusion_matrix, mean_squared_error, r2_score, accuracy_score, classification_report
m1 = 'Logistic Regression'
lr = LogisticRegression()
model = lr.fit(X_train, y_train)
lr_predict = lr.predict(X_test)
lr_conf_matrix = confusion_matrix(y_test, lr_predict)
lr_acc_score = accuracy_score(y_test, lr_predict)
print("confussion matrix")
print(lr_conf_matrix)
print("\n")
print("Accuracy of Logistic Regression:",lr_acc_score*100,'\n')
print(classification_report(y_test,lr_predict))
confussion matrix
[[69 33]
[23 80]]
Accuracy of Logistic Regression: 72.6829268292683
precision recall f1-score support
0 0.75 0.68 0.71 102
1 0.71 0.78 0.74 103
accuracy 0.73 205
macro avg 0.73 0.73 0.73 205
weighted avg 0.73 0.73 0.73 205
m2 = 'Naive Bayes'
nb = GaussianNB()
nb.fit(X_train,y_train)
nbpred = nb.predict(X_test)
nb_conf_matrix = confusion_matrix(y_test, nbpred)
nb_acc_score = accuracy_score(y_test, nbpred)
print("confussion matrix")
print(nb_conf_matrix)
print("\n")
print("Accuracy of Naive Bayes model:",nb_acc_score*100,'\n')
print(classification_report(y_test,nbpred))
confussion matrix
[[58 44]
[27 76]]
Accuracy of Naive Bayes model: 65.3658536585366
precision recall f1-score support
0 0.68 0.57 0.62 102
1 0.63 0.74 0.68 103
accuracy 0.65 205
macro avg 0.66 0.65 0.65 205
weighted avg 0.66 0.65 0.65 205
m3 = 'Random Forest Classfier'
rf = RandomForestClassifier(n_estimators=20, random_state=12,max_depth=5)
rf.fit(X_train,y_train)
rf_predicted = rf.predict(X_test)
rf_conf_matrix = confusion_matrix(y_test, rf_predicted)
rf_acc_score = accuracy_score(y_test, rf_predicted)
print("confussion matrix")
print(rf_conf_matrix)
print("\n")
print("Accuracy of Random Forest:",rf_acc_score*100,'\n')
print(classification_report(y_test,rf_predicted))
confussion matrix
[[73 29]
[15 88]]
Accuracy of Random Forest: 78.53658536585367
precision recall f1-score support
0 0.83 0.72 0.77 102
1 0.75 0.85 0.80 103
accuracy 0.79 205
macro avg 0.79 0.79 0.78 205
weighted avg 0.79 0.79 0.78 205
m4 = 'Extreme Gradient Boost'
xgb = XGBClassifier(learning_rate=0.01, n_estimators=25, max_depth=15,gamma=0.6, subsample=0.52,colsample_bytree=0.6,seed=27,
reg_lambda=2, booster='dart', colsample_bylevel=0.6, colsample_bynode=0.5)
xgb.fit(X_train, y_train)
xgb_predicted = xgb.predict(X_test)
xgb_conf_matrix = confusion_matrix(y_test, xgb_predicted)
xgb_acc_score = accuracy_score(y_test, xgb_predicted)
print("confussion matrix")
print(xgb_conf_matrix)
print("\n")
print("Accuracy of Extreme Gradient Boost:",xgb_acc_score*100,'\n')
print(classification_report(y_test,xgb_predicted))
confussion matrix
[[73 29]
[21 82]]
Accuracy of Extreme Gradient Boost: 75.60975609756098
precision recall f1-score support
0 0.78 0.72 0.74 102
1 0.74 0.80 0.77 103
accuracy 0.76 205
macro avg 0.76 0.76 0.76 205
weighted avg 0.76 0.76 0.76 205
m5 = 'K-NeighborsClassifier'
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train, y_train)
knn_predicted = knn.predict(X_test)
knn_conf_matrix = confusion_matrix(y_test, knn_predicted)
knn_acc_score = accuracy_score(y_test, knn_predicted)
print("confussion matrix")
print(knn_conf_matrix)
print("\n")
print("Accuracy of K-NeighborsClassifier:",knn_acc_score*100,'\n')
print(classification_report(y_test,knn_predicted))
confussion matrix
[[80 22]
[32 71]]
Accuracy of K-NeighborsClassifier: 73.65853658536585
precision recall f1-score support
0 0.71 0.78 0.75 102
1 0.76 0.69 0.72 103
accuracy 0.74 205
macro avg 0.74 0.74 0.74 205
weighted avg 0.74 0.74 0.74 205
m6 = 'DecisionTreeClassifier'
dt = DecisionTreeClassifier(criterion = 'entropy',random_state=0,max_depth = 6)
dt.fit(X_train, y_train)
dt_predicted = dt.predict(X_test)
dt_conf_matrix = confusion_matrix(y_test, dt_predicted)
dt_acc_score = accuracy_score(y_test, dt_predicted)
print("confussion matrix")
print(dt_conf_matrix)
print("\n")
print("Accuracy of DecisionTreeClassifier:",dt_acc_score*100,'\n')
print(classification_report(y_test,dt_predicted))
confussion matrix
[[88 14]
[25 78]]
Accuracy of DecisionTreeClassifier: 80.97560975609757
precision recall f1-score support
0 0.78 0.86 0.82 102
1 0.85 0.76 0.80 103
accuracy 0.81 205
macro avg 0.81 0.81 0.81 205
weighted avg 0.81 0.81 0.81 205
m7 = 'GradientBoostingClassifier'
dt = GradientBoostingClassifier()
dt.fit(X_train, y_train)
dt_predicted = dt.predict(X_test)
dt_conf_matrix = confusion_matrix(y_test, dt_predicted)
dt_acc_score = accuracy_score(y_test, dt_predicted)
print("confussion matrix")
print(dt_conf_matrix)
print("\n")
print("Accuracy of GradientBoostingClassifier:",dt_acc_score*100,'\n')
print(classification_report(y_test,dt_predicted))
confussion matrix
[[85 17]
[17 86]]
Accuracy of GradientBoostingClassifier: 83.41463414634146
precision recall f1-score support
0 0.83 0.83 0.83 102
1 0.83 0.83 0.83 103
accuracy 0.83 205
macro avg 0.83 0.83 0.83 205
weighted avg 0.83 0.83 0.83 205
m8 = 'GaussianProcessClassifier'
dt = GaussianProcessClassifier()
dt.fit(X_train, y_train)
dt_predicted = dt.predict(X_test)
dt_conf_matrix = confusion_matrix(y_test, dt_predicted)
dt_acc_score = accuracy_score(y_test, dt_predicted)
print("confussion matrix")
print(dt_conf_matrix)
print("\n")
print("Accuracy of GaussianProcessClassifier:",dt_acc_score*100,'\n')
print(classification_report(y_test,dt_predicted))
confussion matrix
[[67 35]
[22 81]]
Accuracy of GaussianProcessClassifier: 72.1951219512195
precision recall f1-score support
0 0.75 0.66 0.70 102
1 0.70 0.79 0.74 103
accuracy 0.72 205
macro avg 0.73 0.72 0.72 205
weighted avg 0.73 0.72 0.72 205
m9 = 'SVC'
dt = SVC()
dt.fit(X_train, y_train)
dt_predicted = dt.predict(X_test)
dt_conf_matrix = confusion_matrix(y_test, dt_predicted)
dt_acc_score = accuracy_score(y_test, dt_predicted)
print("confussion matrix")
print(dt_conf_matrix)
print("\n")
print("Accuracy of SVC:",dt_acc_score*100,'\n')
print(classification_report(y_test,dt_predicted))
confussion matrix
[[78 24]
[30 73]]
Accuracy of SVC: 73.65853658536585
precision recall f1-score support
0 0.72 0.76 0.74 102
1 0.75 0.71 0.73 103
accuracy 0.74 205
macro avg 0.74 0.74 0.74 205
weighted avg 0.74 0.74 0.74 205
m10 = 'MLPClassifier'
dt = MLPClassifier(max_iter = 1000, learning_rate_init=0.00001, solver='lbfgs')
dt.fit(X_train, y_train)
dt_predicted = dt.predict(X_test)
dt_conf_matrix = confusion_matrix(y_test, dt_predicted)
dt_acc_score = accuracy_score(y_test, dt_predicted)
print("confussion matrix")
print(dt_conf_matrix)
print("\n")
print("Accuracy of MLPClassifier:",dt_acc_score*100,'\n')
print(classification_report(y_test,dt_predicted))
confussion matrix
[[74 28]
[23 80]]
Accuracy of MLPClassifier: 75.1219512195122
precision recall f1-score support
0 0.76 0.73 0.74 102
1 0.74 0.78 0.76 103
accuracy 0.75 205
macro avg 0.75 0.75 0.75 205
weighted avg 0.75 0.75 0.75 205
def model_assess(model, title = "Default"):
model.fit(X_train, y_train)
preds = model.predict(X_test)
#print(confusion_matrix(y_test, preds))
#print('Accuracy', title, ':', round(accuracy_score(y_test, preds), 5), '\n')
score = round(accuracy_score(y_test, preds), 5);
rf_results = pd.DataFrame([title,score]).transpose()
rf_results.columns = ['Method','Training Score']
return score, rf_results
lr = LogisticRegression()
score, lr_df = model_assess(lr, "Logistic Regression")
nb = GaussianNB()
score,nb_df = model_assess(nb, "Naive Bayes")
rf = RandomForestClassifier(n_estimators = 20, random_state = 12, max_depth = 5)
score,rf_df = model_assess(rf, "Random Forest Classfier")
xgb = XGBClassifier(learning_rate=0.01, n_estimators=25, max_depth=15,gamma=0.6,
subsample=0.52,colsample_bytree=0.6,seed=27,
reg_lambda=2, booster='dart',
colsample_bylevel=0.6, colsample_bynode=0.5)
score,xgb_df = model_assess(xgb, "Extreme Gradient Boost")
knn = KNeighborsClassifier(n_neighbors=10)
score,knn_df = model_assess(lr, "K-NeighborsClassifier")
dt = DecisionTreeClassifier(criterion = 'entropy', random_state = 0, max_depth = 6)
score,dt_df = model_assess(dt, "Decision Tree Classifier")
gbc = GradientBoostingClassifier()
score,gbc_df = model_assess(gbc, "Gradient Boosting Classifier")
gp = GaussianProcessClassifier()
score,gp_df = model_assess(gp, "Gaussian Process Classifier")
svc = SVC()
score,svc_df = model_assess(svc, "SVC")
mlp = MLPClassifier(max_iter = 1000, learning_rate_init=0.001, solver='adam')
score,mlp_df = model_assess(mlp, "MLP Classifier")
score_df_list = [lr_df, nb_df, rf_df, xgb_df, knn_df, dt_df, gbc_df, gp_df, svc_df, mlp_df]
score_df_results = pd.concat(score_df_list,
ignore_index=True).sort_values('Training Score',axis = 0, ascending = False)
score_df_results
| Method | Training Score | |
|---|---|---|
| 6 | Gradient Boosting Classifier | 0.83415 |
| 5 | Decision Tree Classifier | 0.80976 |
| 2 | Random Forest Classfier | 0.78537 |
| 3 | Extreme Gradient Boost | 0.7561 |
| 9 | MLP Classifier | 0.74146 |
| 8 | SVC | 0.73659 |
| 0 | Logistic Regression | 0.72683 |
| 4 | K-NeighborsClassifier | 0.72683 |
| 7 | Gaussian Process Classifier | 0.72195 |
| 1 | Naive Bayes | 0.65366 |